Skip to content

Conversation

@pmachapman
Copy link
Collaborator

@pmachapman pmachapman commented Feb 4, 2026

Fixes: #372

Example USFM:

\id MAN
\c 1.
\v v1 Verse 1

Example warnings:

warnings: [
  "USFM versification error in project TEA, expected verse “”, actual verse “MAN 1.”, mismatch type InvalidChapterNumber (parallel corpus 6983d7b5415e5f0ee3585264, monolingual corpus 6983d7b5415e5f0ee3585260)",
  "USFM versification error in project TEA, expected verse “”, actual verse “MAN -1:v1”, mismatch type InvalidVerseNumber (parallel corpus 6983d7b5415e5f0ee3585264, monolingual corpus 6983d7b5415e5f0ee3585260)",
]

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.81%. Comparing base (9ef0877) to head (68784ea).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage   72.75%   72.81%   +0.06%     
==========================================
  Files         424      424              
  Lines       36153    36213      +60     
  Branches     4991     4996       +5     
==========================================
+ Hits        26302    26368      +66     
+ Misses       8750     8744       -6     
  Partials     1101     1101              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pmachapman pmachapman force-pushed the invalid_verse_or_chapter branch from 39cf6ef to 5ca0162 Compare February 5, 2026 00:51
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?

Good idea. That requires a modification to Serval - see sillsdev/serval/#871

@pmachapman made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 2 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

        }

        public UsfmVersificationError(

Rather than adding another constructor, is there way to just check the error type in CheckError() itself?


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

                    return true;
                }
                if (!_verseRef.Value.Valid)

I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

                    Type == UsfmVersificationErrorType.ExtraVerse
                    || Type == UsfmVersificationErrorType.InvalidChapterNumber
                    || Type == UsfmVersificationErrorType.InvalidVerseNumber

I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like \c 1^, that's pretty easy to find without an expected ref, but what about something like \c text: Is that possible? That would not be quite so easy to track down.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

            // See whether the chapter number is invalid
            VerseRef verseRef = state.VerseRef.Clone();

Why do we need to do this? Does the number get changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error in CheckError() - same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

                {
                    _errors.Add(versificationError);
                    verseInError = true;

What's the particular duplicate error situation you're trying to avoid?

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Rather than adding another constructor, is there way to just check the error type in CheckError() itself?

I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.

_verseRef.Value.Valid will return false, but as it is not clear just what the cause is (i.e. there is no comparable enum to UsfmVersificationErrorType returned by VerseRef), I deal with the invalid chapter/verse error just above this logic.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like \c 1^, that's pretty easy to find without an expected ref, but what about something like \c text: Is that possible? That would not be quite so easy to track down.

I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why do we need to do this? Does the number get changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error in CheckError() - same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.

I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What's the particular duplicate error situation you're trying to avoid?

A missing verse error

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 5 comments and resolved 4 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)

I see, and you couldn't easily add logic there to differentiate between the cases?

That's interesting. For verses, you can still access the bad verse value via vref.Verse (even though vref.VerseNum doesn't give you any clues), but not so with chapters? It's strange to me that they are handled differently.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

_verseRef.Value.Valid will return false, but as it is not clear just what the cause is (i.e. there is no comparable enum to UsfmVersificationErrorType returned by VerseRef), I deal with the invalid chapter/verse error just above this logic.

I see. So even if you grab the verse's ValidStatus enum, it doesn't clue you in? If you're getting vref.Valid == false, then ValidStatus must be set to something other than ValidStatus.Valid - I wonder what it is 🤔


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?

Well, that is true - that could be 2.0 😆. The USFM could skip chapters or verses, but if it is skipping, then that's the sort of thing we'd want to be catching anyways since this checking should all use the versification as the gold standard. I was just thinking that we could use some of the functions in ScrVers to see what the next expected reference is (e.g. something similar to what's done in ScrVers.FirstIncludedVerse() but only looking at subsequent verses). If it's impossible to have totally uninterpretable verses/chapters like \c text or \v word123word, then I'm fine with not having an expected ref, but if it is possible, I think we ought to include something like that since I believe we can make a reasonable guess and there would be no other way for the error to be actionable without running Ctrl+F in every USFM file.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).

I see. Makes sense! Like I said above, it'd odd to me that the bad chapter value can't be accessed like the verses can, but we'll make do!


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

A missing verse error

OK, yeah, I see. I was just wondering if it were worth allowing both errors to register since they're capturing different problems. And if a user got two warnings like Missing MAT 1:2 and Invalid verse MAT 1:schmutz, that may be helpful in locating/understanding the problem.

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 4 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

I see, and you couldn't easily add logic there to differentiate between the cases?

I can if you think it is important to have just one constructor. I will need to pass the string value into that constructor, and update all uses of it?

For verses, you can still access the bad verse value via vref.Verse (even though vref.VerseNum doesn't give you any clues), but not so with chapters?

Correct. See https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/VerseRef.cs#L235-L246


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):

So even if you grab the verse's ValidStatus enum, it doesn't clue you in?

Correct. These errors are only really thrown via the Trace.

ValidStatus must be set to something other than ValidStatus.Valid - I wonder what it is 🤔

ValidStatus will be ValidStatusType.OutOfRange.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Well, that is true - that could be 2.0 😆. The USFM could skip chapters or verses, but if it is skipping, then that's the sort of thing we'd want to be catching anyways since this checking should all use the versification as the gold standard. I was just thinking that we could use some of the functions in ScrVers to see what the next expected reference is (e.g. something similar to what's done in ScrVers.FirstIncludedVerse() but only looking at subsequent verses). If it's impossible to have totally uninterpretable verses/chapters like \c text or \v word123word, then I'm fine with not having an expected ref, but if it is possible, I think we ought to include something like that since I believe we can make a reasonable guess and there would be no other way for the error to be actionable without running Ctrl+F in every USFM file.

Is a missing range error acceptable for this? These will be returned if the range that is invalid is where an expected range should be (see next comment).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, yeah, I see. I was just wondering if it were worth allowing both errors to register since they're capturing different problems. And if a user got two warnings like Missing MAT 1:2 and Invalid verse MAT 1:schmutz, that may be helpful in locating/understanding the problem.

A missing range error will often be returned if the invalid chapter of verse number is a noticeable gap (i.e. noticed by CheckError()), such as:

warnings: [
  "Invalid chapter number error in project TEA at “MAN 1.” (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
  "USFM versification error in project TEA, expected verse “MAN 1:15”, actual verse “MAN -1:15”, mismatch type MissingChapter (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
  "Only 7 segments were selected for training."
]
``

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 3 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I see, and you couldn't easily add logic there to differentiate between the cases?

I can if you think it is important to have just one constructor. I will need to pass the string value into that constructor, and update all uses of it?

For verses, you can still access the bad verse value via vref.Verse (even though vref.VerseNum doesn't give you any clues), but not so with chapters?

Correct. See https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/VerseRef.cs#L235-L246

It's not so much a matter of having one constructor as it is a matter of separation of concerns. If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is. We could always refactor later if we extend the functionality of this class further.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

Is a missing range error acceptable for this? These will be returned if the range that is invalid is where an expected range should be (see next comment).

I think that works, yeah. As long as there's a way for the user to track down where the problem is.


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

A missing range error will often be returned if the invalid chapter of verse number is a noticeable gap (i.e. noticed by CheckError()), such as:

warnings: [
  "Invalid chapter number error in project TEA at “MAN 1.” (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
  "USFM versification error in project TEA, expected verse “MAN 1:15”, actual verse “MAN -1:15”, mismatch type MissingChapter (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
  "Only 7 segments were selected for training."
]
``

I see. If it's a bad chapter, we'll also get a missing chapter error. And if it's a bad verse, we have the chapter and book, so unless it's Psalm 119, it shouldn't be too hard to find haha. Thank you, Peter!

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 1 comment and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is.

I spent the past hour trying to move the logic all in CheckErrors() but it got more and more convoluted with slight change breaking more and more tests :-/ Unfortunately, it looks like the logic will stay where it is.

I can only console myself with the fact is that the logic is in the same source file.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@Enkidu93 made 2 comments.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @pmachapman).


src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is.

I spent the past hour trying to move the logic all in CheckErrors() but it got more and more convoluted with slight change breaking more and more tests :-/ Unfortunately, it looks like the logic will stay where it is.

I can only console myself with the fact is that the logic is in the same source file.

Thank you, Peter!

@pmachapman pmachapman force-pushed the invalid_verse_or_chapter branch from 5ca0162 to 68784ea Compare February 11, 2026 22:05
@pmachapman pmachapman merged commit 84be71d into master Feb 11, 2026
4 checks passed
@pmachapman pmachapman deleted the invalid_verse_or_chapter branch February 11, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The USFM parser should trap the Trace warnings from VerseRef

4 participants